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Abstract 

Generalized estimating equation (GEE) algorithm under a heterogeneous residual variance model is an extension of the 
iteratively reweighted least squares (IRLS) method for continuous traits to discrete traits. In contrast to mixture model-based 
expectation-maximization (EM) algorithm, the GEE algorithm can well detect quantitative trait locus (QTL), especially large 
effect QTLs located in large marker intervals in the manner of high computing speed. Based on a single QTL model, 
however, the GEE algorithm has very limited statistical power to detect multiple QTLs because of ignoring other linked 
QTLs. In this study, the fast least absolute shrinkage and selection operator (LASSO) is derived for generalized linear model 
(GLM) with all possible link functions. Under a heterogeneous residual variance model, the LASSO for GLM is used to 
iteratively estimate the non-zero genetic effects of those loci over entire genome. The iteratively reweighted LASSO is 
therefore extended to mapping QTL for discrete traits, such as ordinal, binary, and Poisson traits. The simulated and real 
data analyses are conducted to demonstrate the efficiency of the proposed method to simultaneously identify multiple 
QTLs for binary and Poisson traits as examples. 
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Introduction 

Corresponding to continuous and discrete random variables in 
statistics, quantitative traits are classified into continuous and 
discrete traits in quantitative genetics. In contrast to discrete traits, 
continuous traits especially normally distributed ones are analyzed 
by taking advantage of the extensively developed inference 
methods available for linear models. Actually, mapping methods 
for continuous quantitative traits are developed prior to discrete 
traits. The earliest QTL mapping for continuous traits can be 
traced back to the interval mapping developed by Lander and 
Botstein [1], while the first group of people to map ordinal traits 
using the EM algorithm is credited to Hackett and Weller [2] and 
Xu and Atchley [3] . 

Binary and categorical discrete traits are commonly observed 
that typically follow binomial and multinomial distributions. A 
binary trait including only two categories is a special case of 
categorical or ordinal trait. Also, binomial trait and multinomial 
trait can be regarded as the derivatives of binary and categorical or 
ordinal traits, defined by the proportions of the number of events 
happened among the total number of trials. Traits measured as 
counts are often called Poisson traits because they are usually 
modeled by a Poisson distribution. The generalized linear model 
(GLM) therefore becomes a natural choice for analyzing the 
discrete traits with the above mentioned distributions [4,5]. Some 
applications of GLM to mapping QTLs have been conducted for 
binary traits [3,6,7], ordinal traits [2,8] and Poisson traits [9,10]. 



The IRLS for the normally distributed traits was extended to 
analyze binary traits [11-13], which gready improves the 
computational efficiency with littie loss in power. Especially, the 
EM algorithm within the framework of GLM have been 
developed to simplify QTL mapping for binary traits and ordinal 
traits [14,15]. In addition, the GEE approach for GLM has been 
adopted to comprehensively analyze multiple mixed traits of 
continuous and discrete trait components [16]. 

Based on the interval mapping with GLM models, a set of 
mapping methods [7,17-19] have been developed to simulta- 
neously map multiple QTLs for discrete traits. As an alternative, 
Bayesian mapping method with reversible-jump MCMC sampling 
[20] was proposed to infer QTLs for binary traits [7]. 
Subsequently, Yi et al. [18] applied a stochastic search variable 
selection method for Bayesian mapping of ordinal traits which 
remarkably improved the sampling efficiency for model param- 
eters. By fitting a continuous prior distribution on genetic effects, 
most recently, hierarchical generalized linear models and compu- 
tationally efficient algorithms have been further developed for 
genome-wide analysis of QTL for various types of phenotypes in 
experimental crosses [19]. In Bayesian mapping, only the method 
with reversible-jump MCMC sampling belongs to fully Bayesian. 
However, the reversible-jump MCMC sampling is usually subject 
to poor mixing. Some Bayesian methods use imputed QTL 
genotypes based on the conditional expectations of the genotype 
given on the flanking marker information, but ignores the 
uncertainties of the imputation process. Unfortunately, no efficient 
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Table 1. The commonly used distributions in the GLM for discrete traits. 
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and convincing method for statistical inference of QTLs is 
provided. 

As an extension of the IRLS method for continuous traits to 
discrete traits, GEE algorithm under a heterogeneous residual 
variance model can well characterize QTLs, especially large QTLs 
located in large marker intervals, in the manner of high computing 
speed [21]. Such a method has been developed for the interval 
mapping of discrete traits. Thus, it has a limited statistical power to 
identify multiple QTLs, without considering other linked QTLs. 
The objectives of this study is to derive a penalized generalized 
linear model with LASSO penalty, and then, to iteratively estimate 
the non-zero genetic effects of those loci over the entire genome, 
under a heterogeneous residual variance model. The iteratively 
reweighted LASSO algorithm [22] is extended to QTL mapping 
for discrete traits such as ordinal, binary, and Poisson traits. 



phenotype has a linear predictor: 



E(y i ) = n i = b 0 +^x i jb j 



(2) 



In many situations, however, a trait in question are measured in 
discrete form, including binary, binomial, categorical, multinomial 
and Poisson traits. Their distributions, summarized in Table 1, 
belong to an exponential distribution family with different link 
functions. The relationship between the mean of a discrete 
phenotype and the linear predictor of genetic effects of k loci (jj) 
is formulated by means of a link function, denoted by 



Methods 

Generalized linear genetic model 

In a QTL mapping analysis, a genetically designed population is 
required to construct linkage relationship between putative QTL 
and markers. In such a population, n individuals are observed for 
phenotypic values and are genotyped for reasonably dense co- 
dominant markers. A genetic linkage map can be constructed 
based on the observed markers or can be obtained from prior 
studies. Based on the linkage information, QTLs are identified by 
inferring the significance of genetic effects for loci on or between 
markers. 

If the trait of interest is normally distributed, the effects of these 
loci on phenotype are generally represented by the following linear 
model: 



= b 0 + V v.A • ;:. 
7 = 1 



where y -, for ( = 1,2, • • ■ ,n is the phenotypic value of the normal 
trait; bo is the population mean; k is the number of putative loci; 
bjis the genetic effect at the jth locus; Xy is the indicator variable 
determined by the genotypes of the jth locus. Note that only of the 
additive genetic effect is considered here for the simplification of 
description, which is suitable for backcross, double haploid and 
recombinant inbred lines. 

In model (1), the residual error 8; is assumed to follow a normal 
distribution with N(Q,a 2 ). Hence, the expectation of a normal 



7=1 

k 

or Lt i =g~ 1 ('h)=g~ l ( h o+ J2 Xi i b i^ 
7=1 



(3) 



for ( = 1,2, • • ■ ,n. 

This is the GLM for multiple QTL mapping with discrete traits. 
In the model, g is the link function. Moreover, the variance of 
discrete phenotype V(yi) can be derived from the distribution of 
each discrete trait (Also shown in Table 1), which is useful for 
estimating model parameters. 

Genetic effect estimation 

Theoretically speaking, the reweighted least square method by 
Wedderburn [5] can be used to estimate the parameters in model 
(3). But implementation of this method is not straightforward, due 
to the fact that the number of parameters estimated may be far 
greater than sample size and the values of indicator variables are 
missing at loci between markers. According to the least squares 
method of Haley and Knott [23], the missing values of indicator 
variables can be simply replaced by its expectation of conditional 
probability given on flanking marker genotypes. This replacement, 
however, could result in over dispersion (Xu 1998). From the 
linear predictor in model (3), the over-dispersion is calculated as 
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■J2 h j^ ar ( x ij) + 2 Y1 £ b i b j ,Cov(x ih x ij ,) + <j> 

7=1 j = 



(4) 



where Var(xy) and Cov(Xy,x,y/) depend on genetically designed 
population, as derived in Liu et al. [22]; ^ e'' is the parameter of 
dispersion in an exponential family. To adjust for the heteroge- 
neity of over-dispersion [21], the linear predictor is standardized 
by the over-dispersion parameter, which lead to 



~ ( b o+Y. Xi J h j 



7 = 1 



(5) 



Instead of rj h the standardized linear predictor f/, is substituted 
into model (3). As in the reweighted least square method for a 
generalized linear model, it is defined that 
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By Taylor expansion, the log-likelihood about model parame- 
ters is quadraticalfy approximated by 



(9) 



with v, = 



and x'a 



With more model parameters than sample size, the unique 
estimates of genetic effects can not be obtained by minimizing the 
log-likelihood above. Actually, there are few non-zero and 
significant genetic effects in model (3), because the number of 
QTLs for a trait is generally not large. In this case, the LASSO 
penalized method with a coordinate descent step can efficiently 
shrink most of genetic effects to zeros by minimizing the following 
function [24,25]: 



E wMi-V(bo- 



7=1 . 
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(10) 



where k is a tuning parameter which will be chosen by cross 
validation. 

In solving model parameters with LASSO, iterations are 
required, as response variable independent variables v, and 
x jj as well as weighted value W; are all a function of the estimated 
parameters. Without heterogeneous over-dispersion, the software 
R/ glmnet can be applied to efficiently search for sparse solutions 
in the oversaturated GLM model [25]. Taking the glmnet as the 
inner loop, the iterative procedure is implemented in the following 
steps: 
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Figure 1. The profiles of -log(p) test statistics of additive genetic effects obtained with IRglmnet method (upper panel) and IRGEE 
method (lower panel) for alopecia areata. In each plot, the genome-wide critical value is marked by a horizontal reference line. Chromosomes 
are separated by the vertical dotted lines and marker positions are indicated by the ticks on the horizontal axis. 
doi:1 0.1 371 /journal.pone.01 06985.g001 



(1) Initialize c, = 1 and all genetic effects as zeros 

(2) Shrink genetic effects with the unweighted glmnet 

(3) Update a, using non-zero genetic effects 

(4) Shrink genetic effects with the iteratively weighted glmnet 

(5) Repeat step 2 and 4 until certain convergence criteria are 
satisfied. 



Statistical inference for QTLs 

The LASSO for oversaturated GLM can provide estimates of 
non-zero genetic effects, but cannot do significance test for the 
estimates. After shrinkage estimation, the number of non-zero 
effects is generally less than the sample size. Substituting for the 
glmnet in iterative procedure, therefore, the reweighted least 
squares method for common GLM can be employed to estimate 
non-zero genetic effects: 
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The variance-covariance matrix of the model parameters is 
estimated by 
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(12) 



The t test statistic is used to infer the significance of the non-zero 
effects, which is calculated as 



for 1 = 1,2, • • • ,k 



V(bi) 



(13) 



It needs to be specially noticed that genetic effects re-estimated 
by reweighted least square method may be biased upward due to 
high variable selection of LASSO above. Meanwhile, population 
structure and marker density influence distribution of the / test 
statistic still. Permutation tests [26] is therefore introduced to 
adjust the critical value of I test statistic. The loci corresponding to 
significant genetic Meanwhile, population structure and mark 
effects are determined as the QTLs for trait of interest. 
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Figure 2. The profiles of -log(p) test statistics of dominance genetic effects obtained with IRglmnet method (upper panel) and 
IRGEE method (lower panel) for alopecia areata. In each plot, the genome-wide critical value is marked by a horizontal reference line. 
Chromosomes are separated by the vertical dotted lines and marker positions are indicated by the ticks on the horizontal axis. 
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Simulations 

The purpose of simulation is to demonstrate the efficiency of the 
method proposed here (IRglmnet for short), by comparing it to the 
GEE (IRGEE for short) algorithm under a heterogeneous residual 
variance model and unweighted glmnet (UWglmnet for short). Six 
chromosomes, each of length 1 00 cM with 1 1 evenly placed co- 
dominant markers are simulated in a backcross population with a 
sample size of 200 and 400. Total 10 QTLs are simulated on the 6 
chromosomes, whose positions and genetic effects are listed in 
Table 2 and Table 3. Assuming population mean to be zero, the 
expectation 17, is calculated based on the simulated genotypes of 
the QTLs and the expectation jU, of discrete traits based on the link 
function. Taking binary and Poisson traits as examples, pheno- 
typic values are randomly generated from binomial and Poisson 
distributions with their known expectations. 

Alopecia areata in mouse 

To locate QTLs linked with alopecia areata, an F 2 population 
was generated from crossing the strain of C3H/HeJ and C57BL/ 
6J mice [27]. The 138 alopecia areata and 214 clinically normal 
mice were genotyped at 1 2 months of age using 2 1 1 microsatellite 
markers. Linkage maps and marker positions were reported on the 
website www.informatics.jax.org. 

Tiller numbers in rice 

This is a Poisson dataset for mapping QTL of tiller numbers in 
rice [28]. A doubled-haploid (DH) population of 123 lines was 
derived from the cross between two inbred lines, semidwarf IR64 
and tall Azucena [29] . Based on this population, a genetic linkage 
map of 2005 cM long was constructed using 1 75 genetic markers. 



For the 123 DH lines, each containing five plants, tiller numbers 
were observed every 10 days until all lines had headed. Phenotypic 
value of each line was obtained by averaging over five plants. 

Results 

Simulation study 

The simulated datasets are analyzed by using IRglmnet, 
UWglmnet and IRGEE. For convenience to compare the three 
mapping methods, all test statistics are transformed to -log(p), 
where p is the probability of greater than the realized statistic 
values. The critical values of the test statistic are determined 
through simulating 1000 samples under the null model with zero 
genetic effects. They are slightly distinguishable among the three 
mapping methods and two sample sizes (Results not shown). The 
simulations are replicated 500 times for estimating QTL 
parameters and assessing the statistical power of QTL detection. 
Statistical power of QTL detection is counted by each locus as the 
percentage of the number of those simulations that statistic value 
exceeds critical value at the locus. 

The statistical performances of different scenarios are presented 
in Table 2 for position estimate comparison, in Table 3 for QTL 
parameter estimate comparison and in Table 4 for power 
comparison, As can be seen, the IRglmnet is mostly identical to 
the UWglmnet and the two glmnet methods are advantageous to 
IRGEE in terms of power to detect QTL. The IRglmnet methods 
can accurately estimate QTL genetic effects, while UWglmnet 
somewhat underestimates and IRGEE can not estimate well QTL 
effects. Meanwhile, both glmnets are able to detect QTL positions 
with higher precision than IRGEE. Under the same genetic design 
and sample size, the two traits analyzed are evidently distinct in 



PLOS ONE I www.plosone.org 



6 



September 2014 | Volume 9 | Issue 9 | e1 06985 



Mapping Discrete Trait Loci with LASSO 



QTL parameter estimation and power of QTL detection at the 
QTLs of small genetic effects. As compared to normally 
distributed traits in Liu et al. [22], the statistical powers to detect 
QTL for the two analyzed traits are higher at all the simulated 
QTLs except for the QTLs of small genetic effects, even with large 
sample size. This implies that in principle, it is generally difficult to 
detect QTLs for discrete traits. In addition, the statistical 
performance of each mapping method increases as sample size 
and QTL genetic effect increase, as observed in usual QTL 
mapping. 

The advantage of our proposed method lies in the computing 
efficiency as well. Although the IRglmnet method with cross 
validation takes more computing time than the UWglmnet, the 
two glmnet methods run considerably fast, as compared to the 
IRGEE. On an Intel core 4 PC with a 3.8 GHz processor, the 
UWglmnet and IRglmnet for binary data consume 10.8 seconds 
and 24.5 seconds on average, respectively, whereas the IRGEE 
takes 1.2 minutes under a sample size of 200. For Poisson data, the 
UWglmnet, IRglmnet and IRGEE run 14.3 seconds, 33.6 seconds 
and 1.8 minutes, respectively. The difference in computing time- 
consuming gets larger as the sample size increases. 

Mapping QTL for alopecia areata 

In an F 2 population, there are three genotypes at each locus, 
denoted by QQ Qq and qq, so that QTL genetic effect can be 
partitioned into additive and dominance effects. The linear 
predictor for alopecia areata traits is described as 



where, fi is the population mean, a ; is the additive effect at thejth 
locus, Zy is the indicator variable corresponding to the additive 
effect, defined as + 1 for QQ 0 for Qq and — 1 for qq. dj is the 
dominance effect at the jth locus, it'// is the indicator variable 
corresponding to the dominance effect, defined as 0 for 
homozygote and 1 for heterozygote. 

With probit link function, the dataset is analyzed by using 
IRglmnet, UWglmnet and IRGEE methods. The genome-wide 
critical threshold values for declaring QTL significance are 
obtained by using 1000 permutation tests. The critical values are 
distinguishable between the two glmnets methods and IRGEE 
method, which is marked by horizontal reference line in Figure 1 
and Figure 2. The comparative plots in the profiles of -log(p) 
statistics between IRglmnet and IRGEE methods are depicted in 
Figure 1 and Figure 2 by the mode of inheritance. The over- 
dispersion parameter for each individual is much closed to 1 in 
running IRglmnet method, so that the results obtained with the 
UWglmnet are exactly the same as those obtained with the 
IRglmnet. The QTLs are generally determined according to the 
peaks that exceed corresponding critical values. As can be seen 
from the two Figures, the IRglmnet finds not only those QTLs 
detected with IRGEE but also more QTLs than those detected 
with IRGEE. Surprisingly, most of QTLs are located on markers 
in the genetic map with moderate density. 

Table 5 tabulates parameter estimates of QTLs detected with 
three mapping methods. A total of 10 QTLs are identified for 
alopecia areata, of which, 6 are inherited in the additive mode and 
4 in the dominance mode. Interestingly, the QTLs on chromo- 
some 1 and chromosome 8 are completely different in the mode of 
inheritance between the two glmnet methods and IRGEE method. 
The proportions of phenotypic variation explained by the 
detectable QTLs varied from 2% to 49%. The largest heritability 
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Figure 3. The profiles of -log(p) test statistics obtained with IRglmnet method (upper panel) and IRGEE method (lower panel) for 
tiller numbers. In each plot, the genome-wide critical value is marked by a horizontal reference line. Chromosomes are separated by the vertical 
dotted lines and marker positions are indicated by the ticks on the horizontal axis. 
doi:1 0.1 371 /journal.pone.01 06985.g003 



(49%) is of the QTL on chromosome 17, which is more than five 
times of the second highest heritability (9%). The estimates for 
genetic effects from the IRglmnet are twice more than those from 
the UWglmnet, but their estimated heritabilies are roughly the 
same, except for the largest heritability. Meanwhile, the three 
mapping methods are able to consistently detect major locus on 
mouse chromosome 17 and minor locus on chromosome 9, as 
reported in Sundberg et al. [27]. 

Mapping QTL for tiller numbers 

Taking the tiller numbers at the third stage (30 days after 
transplanting) as an example, the QTLs for the traits are located 
by using the three competing mapping methods. Figure 3 (upper 
panel) illustrates that there are two peaks that pass through the 
horizontal line of critical value 2.043 at 5% genome-wide 
significant level. This suggests that the two QTLs are indentified 
by the IRglmnet and the UWglmnet. One of the detected QTLs is 
located between markers MK23 and MK24 on chromosome 2, 
another between markers MK48 and MK49 on chromosome 3. 
They explain 1.9% and 2.1% of phenotypic variance, respectively. 
There is almost no difference in the estimated heritability between 
the UWglmnet and the IRglmnet, but the estimates for genetic 
effect obtained with UWglmnet are less than that with IRglmnet. 
In contrast, the IRGEE finds only one QTL of those identified by 
both glmnet methods, as shown in Figure 3 (lower panel). 



Discussion 

Two extensions are realized to map QTL for discrete traits: one 
is that of the GEE algorithm under a heterogeneous residual 
variance model by Xu and Hu [21] for a single QTL model to 
multiple QTL model, and another is that of IRLASSO for the 
continuous normal traits [22] to discrete ones. The glmnet with 
coordinate descent step is used for fast estimation of non-zero 
effects, followed by few non-zero genetic effects estimated and 
statistically inferred with a regular GLM. Like regular interval 
mapping, the method proposed here can, not only estimate QTL 
effects, but also assess the significance of QTLs. Although our 
mapping method is developed for improving linkage analysis with 
low marker density, it is also appropriate for missing genotypes 
that always happen in QTL mapping. 

R/glmnet can efficiently fit binary, categorical and Poisson data 
with logistic and Poisson regression models [22]. However, it can 
not be used to directly analyze binomial and multinomial data and 
only logit and log link functions are used in solving for the 
oversaturated GLM. In this study, a general LASSO procedure for 
the GLM is derived based on all possible link function for discrete 
traits, which can be incorporated into the R/glmnet with little 
modification. From a statistical viewpoint, the choice for link 
function can be somewhat arbitrary, but it is necessary to 
understand the biological meaning of discrete traits. For instance, 
the biological mechanism of binary and categorical traits can be 
well interpreted by the threshold model with the probit link 
function. 
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Our proposed method can be simplified to a notable extent in 
the practice of QTL mapping. As demonstrated in simulations, the 
UWglmnet is mostly consistent with IRglmnet method in terms of 
power to detect QTL, although it underestimates the genetic 
effects of QTLs. By removing the iteratively weighted step for 
IRglmnet, non-zero effects can be solved by using UWglmnet for 
simple computation. The iteratively weighted step is only used in 
GLM analysis for reestimating few non-zero genetic effects. In R/ 
glmnet, additionally, cross validation (CV) is always introduced to 
improve shrinking efficiency. The fewer non-zero genetic effects 
are retained after the CV.glmnet, but more computing time is 
needed. Actually, the CV step can be ignored in the QTL 
mapping, as the CV.glmnet just drops the redundant non-zero 
genetic effects of those obtained with the glmnet without CV. Prior 
to the GLM analysis, the redundant non-zero genetic effects solved 
by using the glmnet without CV can be simply removed by 



restricting that there is only one QTL within a marker interval. 
Without the iteratively weighted step and cross-validation for 
glmnet, our proposed mapping method can save more computing 
time than the GEE algorithm under a heterogeneous residual 
variance model by Xu and Hu [21]. Our proposed method has 
been coded in the program on the IRLASSO [22], which can 
handle normally distributed, binary, binomial, ordinal, multino- 
mial and Poisson traits. The program is freely available upon 
request from the authors. 
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